Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces manifest.yaml that is the "last working world state" #405

Merged
merged 30 commits into from
Dec 18, 2023

Conversation

terrykong
Copy link
Contributor

@terrykong terrykong commented Nov 30, 2023

This introduces "the manifest" (manifest.yaml) that describes the complete state of the jax stack to allow for reproducible nightly builds and reproducible presubmit CI. The manifest (and patches) are staged in a "trial branch" each night, and if the build succeeds, we can merge the trial branch into main (TODO: GH Issue tracking automating merge). The presubmit CI runs on the PR's git-ref and so unless the author has committed a custom patch/manifest, they will always be running from the "last working state".

Description of manifest.yaml

The manifest allows specifying/pinning libraries that serve different purposes:

  • git repos: libraries that need their source checked out at a particular SHA
  • pip VCS constraints: VCS dependency specs that need to be pinned like fiddle @ git+https://github.com/google/fiddle
  • pip constraints: Python packages we need to constrain or pin to address one-off fixes, e.g., addressing a CVE

Here are example entries of each of these in the manifest.yaml

Git repos

jax:
  url: https://github.com/google/jax.git
  tracking_ref: main
  latest_verified_commit: b032a0271e3e2ea8d0df64d2f3f1a1e450a38dc9  # 2023-11-15
  mode: git-clone
  • url: This is the url to lookup the latest git ref
  • tracking_ref: This is the git-ref to lookup the latest commit SHA. Usually it’s main, but in some cases like lingvo, it’s master
  • latest_verified_commit: SHA
  • mode:
    • git-clone: Whether to clone the repo or not

Git repos with patches

flax:
  url: https://github.com/google/flax.git
  mirror_url: https://github.com/nvjax-svc-0/flax.git
  extra_dir: null
  tracking_ref: main
  latest_verified_commit: a572f6af2fef565c0f9ba2fc12b781e9e3385140
  mode: git-clone
  patches:
    pull/3340/head: file://patches/flax/PR-3340.patch # Add Sharding Annotations to Flax Modules
  • mirror_url: a git url where we can pull patches that start with ^mirror/
  • extra_dir: an optional local dir to pull patches not found in upstream or the mirror. Useful for private repos
  • patches: a dictionary where keys are git-refs and values are URIs for the local patch. The reason for the value is the git-ref can be rebased or be updated and the SHA will change; this makes patches a moving target when using git refs. So each file:// URI will be a patch committed into Jax-Toolbox's VC to ensure reproducibility

VCS Constraint

clu:
  url: https://github.com/google/CommonLoopUtils.git 
  tracking_ref: main
  latest_verified_commit: 89c2face3474a7482358068d7a00d9bb6e4b31fe
  mode: pip-constraint

These aren't cloned (in fact get-source.sh will error if you try to clone). These are used in pip-finalize.sh to pin VCS dependencies like

clu @ git+https://github.com/google/CommonLoopUtils#egg=clu

# To

clu @ git+https://github.com/google/CommonLoopUtils@89c2face3474a7482358068d7a00d9bb6e4b31fe

Pip constraint

pydantic:
  mode: pip-constraint
  version: X.Y.Z
  • mode: pip-constraint
    • Adds a pip constraint during pip-compile
  • version: If ${package}.version is present, then we can treat this as a pip constraint. This allows us to hotfix any python package in case we discover a bug or CVE.

Changes to the CI

  • Nightly JAX Build bumps the manifest.yaml and patches and commits them to a trial branch. If the tests pass, this trial branch should be merged to main
    • introduces bump.sh that bumps the world state given the manifest
  • No more REPO_* and REF_* build args, since they are all specified in the manifest.yaml
  • get-source.sh and create-distribution.sh now take the manifest
  • Rename pip-tools "manifests" to "requirements-*.in` to avoid confusion and use the same terminology as pip-tools

Not addressed in this PR

  • Automating the merging of the trial branch at the end of a successful nightly run
  • Running presubmit on latest world state (currently only will run on last-known working state)
  • Move lingvo patches to the manifest (Move lingvo patches into the manifest #409 )
  • Move xla arm neon patch to the manifest
  • Add a cleanup.sh script to remove things like ~/.gitconfig
  • Have not implemented mode: pip-constraint since there was no dep that needed a constraint

@yhtang
Copy link
Collaborator

yhtang commented Nov 30, 2023

The history looks a bit weird. Maybe a rebase is needed? 😀

@yhtang
Copy link
Collaborator

yhtang commented Nov 30, 2023

Some nitpicks mostly around naming:

  • Would ref be better named latest_verified_commit or something along the line?
  • mirror_url -> mirrors: turn it into a list instead of a single string?
  • extra_dir -> patch_dir to be more explicit?

@DwarKapex DwarKapex requested review from yhtang, ashors1, joker-eph and DwarKapex and removed request for yhtang and ashors1 November 30, 2023 20:13
@terrykong
Copy link
Contributor Author

@yhtang

Some nitpicks mostly around naming:

  • Would ref be better named latest_verified_commit or something along the line?
  • mirror_url -> mirrors: turn it into a list instead of a single string?
  • extra_dir -> patch_dir to be more explicit?

Re: latest_verified_commit
SGTM. That sounds more clear

Re: mirror_url
I only envisioned one mirror being used. Did you have a use case for a list (multiple mirrors)?

Re: extra_dir Maybe I could call this mirror_dir to be more clear? It's expected to be a locally cloned mirror. This circumvents the need to git remote add a private repo, so it's supposed to point to a local git repo as opposed to a patch dir. Thoughts on mirror_dir? FWIW, the local patches dir is always located in the same dir as the manifest

ADD install-pax.sh /usr/local/bin
ADD install-flax.sh /usr/local/bin
ADD install-te.sh /usr/local/bin
# update TE manifest file to install the [test] extras
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here but not in Dockerfile.jax? TE is a part of JAX image now, so I would assume someone wants to test TE in JAX container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind. Any reason you kept it separate @yhtang ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I took a look at it, and it appears we could merge the two without any issue since there's just a sed in Dockerfile.pax.amd64 that replaces the line, so I think it'd be fine to do that in requirements-jax.in

Any objections? @DwarKapex @yhtang

Actually, another issue I found looking at this is that Dockerfile.pax.arm64 doesn't include this sed. Is that intentional @yhtang ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As I mentioned during the CI sync, IIRC the [test] extra could not be installed in ARM64.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can address this as part of #338.

.github/container/bump.sh Outdated Show resolved Hide resolved
.github/container/bump.sh Outdated Show resolved Hide resolved
.github/container/bump.sh Outdated Show resolved Hide resolved
@DwarKapex
Copy link
Contributor

Try to rebase/merge to main again (or create another clean PR), because it is pretty difficult to get an idea what changes are yours vs what changes came from add-pip-compile PR

@terrykong
Copy link
Contributor Author

Try to rebase/merge to main again (or create another clean PR), because it is pretty difficult to get an idea what changes are yours vs what changes came from add-pip-compile PR

SG. I've rebased (preview: #406 ), but it was a pretty gnarly rebase. Without reparenting, just squashing had conflicts. I need to test that I haven't broken anything, but once I've tested, I'll force push here

@terrykong
Copy link
Contributor Author

terrykong commented Dec 8, 2023

RUN mkdir -p /opt/pip-tools.d
ADD --chmod=777 \
get-source.sh \
pip-finalize.sh \
/usr/local/bin/
RUN wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_$(dpkg --print-architecture) -O /usr/local/bin/yq && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a guideline yet regarding third-party PPAs. Let's stick with this simple wget for the time being.

ADD install-pax.sh /usr/local/bin
ADD install-flax.sh /usr/local/bin
ADD install-te.sh /usr/local/bin
# update TE manifest file to install the [test] extras
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As I mentioned during the CI sync, IIRC the [test] extra could not be installed in ARM64.

ADD install-pax.sh /usr/local/bin
ADD install-flax.sh /usr/local/bin
ADD install-te.sh /usr/local/bin
# update TE manifest file to install the [test] extras
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can address this as part of #338.


# build lingvo
RUN <<"EOT" bash -exu
set -o pipefail
RUN <<"EOF" bash -exu -o pipefail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used EOT in consistency with Docker's official guide. This way, you don't have to rename the inner EOF to EOFINNER.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we don't have to rename, but I feel that it's more clear since a question already came up about its significance. This way the identifier is also self-documenting

echo "-e file://${SRC_PATH_PRAXIS}" >> /opt/pip-tools.d/manifest.pax
echo "tensorflow==2.13.0" >> /opt/pip-tools.d/requirements-paxml.in
echo "tensorflow_datasets==4.9.2" >> /opt/pip-tools.d/requirements-paxml.in
echo "chex==0.1.7" >> /opt/pip-tools.d/requirements-paxml.in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding EOT vs EOF, EOT is what Docker's official documentation used for demoing heredoc RUN. But ultimately this choice is arbitrary and up to us to reach a consensus.

@@ -105,6 +100,28 @@ jobs:
BASE_IMAGE: ${{ needs.metadata.outputs.BASE_IMAGE_ARM64 }}
secrets: inherit

publish-build-badge:
needs: [metadata, amd64, arm64]
uses: ./.github/workflows/_publish_badge.yaml
Copy link
Collaborator

@yhtang yhtang Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_publish_badge.yaml is deprecated. In the future, we will generate Badge JSONs as part of the job sitrep artifact, and then collectively publish them using _fianlize.yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should move to sitrep, but given how large this PR is, I'd like to tackle that in follow-up PR. This shouldn't change the status-quo; in fact it should fix things b/c this badge is referenced on the front-page readme and it was somehow removed in a PR so it hasn't been getting updated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I just wanted to raise the issue so that you are aware of the coming change.

publish-build-badge:
needs: [metadata, amd64, arm64]
uses: ./.github/workflows/_publish_badge.yaml
if: always()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of always(), use !cancelled() to prevent the step from running even if the workflow is cancelled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think we want always(). If upstream workflows fail, then this workflow is canceled, but if the badge creation only happens if !cancelled(), then the badge does not get updated, and we see yesterday's badge.

uses: ./.github/workflows/_publish_badge.yaml
if: ( always() )
if: always()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"!cancelled()"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded on other thread

# TODO: ARM
publish-build-badge:
needs: [metadata, amd64, arm64]
uses: ./.github/workflows/_publish_badge.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for _publish_badge.yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded on other thread

needs: [metadata, amd64, arm64, test-unit, test-t5x]
# TODO: ARM
publish-test-badge:
needs: [metadata, amd64, test-unit-amd64, test-t5x-amd64]
uses: ./.github/workflows/_publish_badge.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for _publish_badge.yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded on other thread

@yhtang
Copy link
Collaborator

yhtang commented Dec 12, 2023

I feel that the way that we pass the version bump targets around could be better streamlined. The current trial branch approach seems a little risky. Admittedly, this could be done much more easily if the presubmits and the nightlies can be converged first.

@terrykong
Copy link
Contributor Author

I feel that the way that we pass the version bump targets around could be better streamlined. The current trial branch approach seems a little risky. Admittedly, this could be done much more easily if the presubmits and the nightlies can be converged first.

@yhtang I don't think there's any risk b/c the ability to merge the trial branch isn't included in this change-set. Nightlies will continue to be built via bump.sh. If nightlies and presubmits converged, I agree this would be more easily done, but I think we can address that after this is in. If we wait for the converge PR, that'll add +N weeks and then this change needs to be rebased after N weeks of changes and that seems like avoidable dev-work.

I am in favor of merging this in as it works and the converge PR can make use of the mechanism I am introducing here.

@@ -197,7 +197,6 @@ t5x/contrib/gpu/scripts_gpu/singlenode_ft_frompile.sh \

# Known Issues
* There is a known sporadic NCCL crash that happens when using the T5x container at node counts greater than or equal to 32 nodes. We will fix this in the next release. The issue is tracked [here](https://github.com/NVIDIA/JAX-Toolbox/issues/194).
* The T5x nightlies disable `NCCL_NVLS_ENABLE=0` ([doc](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable)). Future releases will re-enable this feature.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this was mistakenly deleted during rebase (revert)

@yhtang
Copy link
Collaborator

yhtang commented Dec 18, 2023

I'll move to merge this PR once the new tests are done after the rebasing. Follow-up work could include:

  1. Converge the pre-submit and nightly workflows, which will automatically solve the following issues
    • Existing issue: the logic to decide whether to publish a nightly badge depends on upstream status (hidden dependency)
    • New issue: a workflow can use a manifest file from a branch other than the one that it is triggered on
  2. Avoid creating trial branches for pre-submits. Currently, a trial branch seems necessary if we want to test the edge/latest versions. This will be problematic if we matricize the CI, considering the number of trial branches that will be created by each
    matrix.
    • Related issue: the name of a trial branch is passed between various nightly workflows. A better approach is to always use the manifest file from the same branch that triggers the workflow. This should be fixed automatically if we complete (1).
  3. Allow partial bump of the manifest. For example, if T5X test failed, it should not block the version bump for PAX unless the source of the error is in the common ancestor of the two builds.

@yhtang
Copy link
Collaborator

yhtang commented Dec 18, 2023

Merging as the remaining tests are all matrix MGMN tests (with partially confirmed success) and there are ~200 backlog jobs in the queue at this moment.

@yhtang yhtang merged commit fd58524 into main Dec 18, 2023
98 of 100 checks passed
@yhtang yhtang deleted the the-manifest branch December 18, 2023 23:17
yhtang added a commit that referenced this pull request Dec 19, 2023
There was a YAML indentation issue when I last time rebased #405 and it
ended up causing the comment `#` to enter the image list when we publish
PAX nightly images.
olupton added a commit that referenced this pull request Jan 4, 2024
As of #405 the presubmit CI is based on package versions listed in the
`manifest.yaml` that is committed to the repository. This has not been
updated for ~1 month, so the presubmit CI is testing ~1 month old
versions of the ecosystem. This PR updates it using the commit from
https://github.com/NVIDIA/JAX-Toolbox/tree/znightly-2024-01-03-7395605285,
generated by the nightly CI run.

Because this bumps the JAX version by ~1 month, we have to include fixes
for deprecations. In particular replacing `jax.random.KeyArray` with
plain `jax.Array`
(nvjax-svc-0/t5x@4d5ec2f).

The deprecated name is used in older versions of the `chex` package,
which are being selected by pip's dependency resolver despite newer
versions being available. We avoid this by giving pip a helping hand and
nudging it to use a newer `numpy` version, which allows it to select a
newer `chex`. But it's easy to imagine similar issues in future with
other packages.

Closes #448.
yhtang pushed a commit that referenced this pull request Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants